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1 Introduction 


Estimating the level sets of a probability density function has a wide range of appli¬ 
cations, including anomaly detection (outlier detection) (Breunig et ah, 2000; Hodge 
and Austin, 2004), two-sample comparison (Duong et al., 2009), binary classifica¬ 
tion (Mammen and Tsybakov, 1999), and clustering (Rinaldo and Wasserman, 2010; 
Rinaldo et ah, 2012). In this paper, we study the problem of estimating the level set 

D h = D h { A) = {x : p h (x) = A} , (1) 

where Ph is the expected kernel density estimator with bandwidth h, a smoothed 
version of the underlying density p. Using D h (and thus ph) has several advantages, 
which we discuss in detail in Section 2.2. Figures 5 and 10 illustrate the kind of 
confidence sets and visualizations we will develop in this paper. 

A commonly used estimator of the density level-set is the plug-in estimator D^ = 
{x : Ph(x) — A}, where ph is the kernel density estimator or some other density esti¬ 
mator. There is a large literature for level sets (and upper level sets, which replace 
= A with > A) that focuses on the consistency, rates of convergence (Polonik, 1995; 
Tsybakov, 1997; Walther, 1997; Cadre, 2006; Cuevas et ah, 2006) and minimaxity 
(Singh et ah, 2009) of such estimators under various error loss functions. 

Recent results on statistical inference for level sets include Jankowski and Stan- 
berry (2012) and Mammen and Polonik (2013). Statistical inference is challeng¬ 
ing in this setting because the estimand is a set and the estimator is a random set 
(Molchanov, 2005). Mason and Polonik (2009) establish asymptotic normality for 
upper level sets when the loss function is the measure of the set difference. However, 
it is unclear how to derive a confidence set from this result. 

Another challenge of level set estimation is that we cannot directly visualize the 
level sets when the dimension of the data d is larger than 3. One approach is to 
construct a level-set tree, which shows how the connected components for the upper 
level sets bifurcate when we gradually increase A (Stuetzle, 2003; Klemela, 2004, 
2006, 2009; Stuetzle and Nugent, 2010; Kent et ah, 2013). The level-set tree reveals 
topological information about the level sets but loses geometric information. 

In this paper, we propose solutions to all of these problems. Our main contribu¬ 
tions can be summarized as follows. 


2 


1. We derive the limiting distribution of Haus (Dh,Dh) (Theorem 3). 

2. We develop two bootstrap-based methods to construct confidence regions for 
D h (Section 4). 

3. We prove that both bootstrap methods are valid (Theorem 4 and 5). 

4. We devise a visualization technique that preserves the geometric information 
for density level sets (Section 5). 

Related Work. Early work on density level set focuses on proving the consistency 
or the rate of convergence under various metrics. See e.g. Polonik (1995); Tsybakov 
(1997); Walther (1997); Cuevas et al. (2006); Rinaldo and Wasserman (2010). How¬ 
ever, none of these derives a limiting distribution for the density level sets. To our 
knowledge, the only paper that considers limiting distributions is Mason and Polonik 
(2009), proving asymptotic normality under a generalized integrated distance. How¬ 
ever, this metric cannot be used to construct a confidence set for density level sets 
since the asymptotic distribution involves the true density, which is unknown. Es¬ 
timating the level set is also related to support estimation, see e.g. Cuevas and 
Rodrfguez-Casal (2004) and Cuevas (2009). 

Jankowski and Stanberry (2012) and Mammen and Polonik (2013), both provide 
methods for constructing confidence sets for the density level sets using the variation 
of the density function. Our approach is similar to theirs but is based on Hausdorff 
distance. We will compare their methods to ours in Section 4.2. 

Outline. We begin with a short introduction to density level sets along with 
some useful geometric concepts in Section 2. In Section 3, we derive the limiting 
distribution of the Hausdorff distance between the estimated and true level sets. In 
Section 4, we construct a valid confidence set for density level sets. In Section 5, we 
devise a visualization method for density level sets that is simple to interpret and 
efficient to compute. (We provide an R package that implements our visualization 
method.) We summarize our results and discuss related problems in Section 6. 
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2 Technical Background 

2.1 Level Sets 

Let X], - ■ ■ , X n be a random sample from an unknown, continuous density p(x). We 
define the density level set by 

D = D(X) = {x : p(x) = A} (2) 


for some A > 0. Note that in the literature the term level sets is sometimes used for 
the set {x : p(x) > A}; we call the latter the upper level set to distinguish it from the 
level set in equation 2. Thus, the level sets (in our terminology) are the boundaries to 
the upper level sets, under mild smoothness assumptions (e.g., assumption G below). 

We assume that A is a fixed, positive value. A plug-in estimate for D is 

D h = D h ( A) = {x : p h (x) = A}, (3) 


where p/, is the kernel density estimator (KDE), 

1 n 

= 2 > 
i —1 




(4) 


2.2 Smoothed Density Level Set 


In this paper, we focus on inference for the level sets of a smoothed version of p, 
specifically: 

Ph =p* K h = E^), (5) 

where K h (x) = ppK and * denotes convolution. We denote the A level set of 

Ph by 


D h = {x : p h {x) = A} . 


( 6 ) 


Note that although we focus on estimating D we allow h = h n —>■ 0 as n —> oo. 

Here, we argue that ph (and hence D h ) is a better target for level-set estimation 
than p (and hence D). For simplicity, in this section we focus on the upper level sets 
Lh(\) = {x : p h (x) > A} and L( A) = {x : Ph(x) > A}, but the gist of the argument 
remains the same in either case. 

Our arguments are: (i) ph always exists while p may not even exist; (ii) ph can 
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be naturally viewed as an estimator for p h ; (iii) p h has all the salient structure of p 
but is easier to estimate than p\ (iv) While the bias Ph(x) — p(x) may be analyzed 
theoretically, in practice, it cannot be accurately estimated. It is better to just be 
clear that we are really estimating ph- 

Let us now expand on some of these points. Regarding point (ii): Let Xi,, X n 
be an iid sample from P, where P is supported on some compact set IK C M. d . We 
have, for any P, 

I Phix) -p h (x) j > e j < de~ C2nhde2 . 

As long as h > (log n/n) 1 ^, P n (sup a , | Ph(x) — ph{x) \ > e) —* 0, and hence we can 
uniformly consistently estimate ph, whether we keep h fixed or let it tend to 0. The 
same is not true for p. In fact, P may not even have a density p. The bias cannot be 
uniformly estimated in a distribution-free way. 

Regarding (iii): The left plot in Figure 1 shows a density p. The blue points at 
the bottom show the upper level set L = {x : p > 0.05}. The right plot shows ph for 
h — 0.2 and the blue points at the bottom show the upper level set Lh = {x : Ph > 
0.05}. The smoothed out density ph is biased and the upper level set Lh. loses the 
small details of L. But these small details are the least estimable, and Lh captures 
the principal structure of L. In addition, when L is smooth (having positive //-reach; 
Chazal and Lieutier 2005; Chazal et al. 2009) and h is sufficiently small, Lh and L 
will be topologically similar, in the sense that a small expansion of Lh is homotopic to 
L (Chazal and Lieutier 2005; Chazal et al. 2009; Genovese et al. 2014). The /i-reach 
and the concept of being nearly homotopic can be found in section 3.2 of Genovese 
et al. (2014) and section 4 of Chazal et al. (2009). 

As a second example, let P = (l/3)0(:r; —5,1) + (l/3)ho + (l/3)0(a;; 5,1) where 0 
is a Normal density and ho is a point mass at 0. Of course, this distribution does not 
even have a density. The left plot in Figure (2) shows the density of the absolutely 
continuous part of P with a vertical line to who the point mass. The right plot shows 
Ph, which is a smooth, well-defined density. Again the blue points show the level sets. 
As before ph and L h are slightly biased, but they are also well-defined. And ph and L h . 
are accurate estimators of ph and Lh, respectively. Moreover, Lh captures the most 
important qualitative information about L, namely, that there are three connected 
components, one of which is small. These examples show that ph — and hence Dh 


P n sup 
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Figure 1: Left: a density p and an upper level set {p > t}. Right: the smoothed 
density ph and the upper level set {ph > t}. 

— is a sensible target of inference. 

Lastly, we would like to point out that the idea of viewing p h as the estimand is 
not new. The “scale space” approach to smoothing explicitly argues that we should 
view ph as an estimate of ph, and ph is then regarded as a view of p at a particular 
resolution. This idea is discussed in detail in Chaudhuri et al. (2000); Chaudhuri and 
Marron (1999); Godtliebsen et al. (2002). 

If one really wants to focus on making inference for D , the level set of the original 
density, then we need to undersmooth 1 so that the bias will not affect the limiting 
distribution. This leads to estimates of D that are highly variable. We believe that 
an accurate confidence set for Dh is more useful than a poor confidence set for D. 

Remark 1. These arguments explain why we regard ph rather than p as the estimand. 
But these arguments do not tell us how to choose h. Bandwidth selection is always 
a challenge and in this paper we mainly use Silverman’s rule of thumb (Silverman 
(1986)). 

1 Undersmoothing the density estimate to make statistical inferences is a common practice in 
nonparametric statistics; see e.g. page 89 of Wasserman (2006). 
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Figure 2: Left: a distribution with a continuous component and a point mass at 0. 
Right: the smoothed density ph- The upper level set Lh is biased but is estimable 
and it approximates the main features of L. 

2.3 Geometric Concepts 

Let tta(x) be the projection of a point x onto a set A. Note that tta(x) may not be 
unique. The distance induced by the projection is 


d(x,A) = inf{||x - y || 2 :yeA} = \\x- ^ a { x )\\ 2 (7) 


A common measure of distance between two subsets of a metric space (e.g., M d ) 
is the Hausdorff distance , given by 


Haus(A, B) — infje : A C B © e and B C A © e} 


= max 


< sup d(x, A), sup d(x, B) 

x£A 


( 8 ) 


where A©e = [j xeA B(x,e) and B(x,e) = {y : ||x — y\\ < e}. The ffausdorff distance 
is a generalized version of the metric for sets. 

The reach of a set M (Federer 1959; Cuevas 2009, also known as condition num¬ 
ber Niyogi et ah 2008 or minimal feature size Chazal and Lieutier 2005) is the largest 
distance from M such that every point within this distance to M has a unique pro¬ 
jection onto M. i.e. 


reach(M) = sup{r : tt m (x) is unique Vx G M © r}. 


(9) 
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Figure 3: An illustration for reach. The reach is the largest radius for a ball that can 
freely move along the set M. In (a), the radius of the pink ball is equal to the reach. 
In (b), the radius is too large so that it cannot pass the small gap on M. 




Figure 4: An example for two normal compatible curves. Panel (a): each thin red 
line indicates the projection from a point of M onto M'. Panel (b): each thin black 
line indicates the projection from a point of M' onto M. When these projections are 
one to one and onto, we say M and M' are normal compatible to each other. 


Note that 7 ta(x) is unique if 0 < d(x,A ) < reach(A). Another way to understand the 
reach is as the largest radius of a ball that can freely move along M; see Figure 3 for 
an example. In some cases, the reach is the same as the smallest radius of curvature 
on M. The reach plays a key role in relating the Hausdorff distance to the empirical 
process. Note that the reach is closely related to ‘rolling properties’ and ‘a-convexity’; 
see Cuevas (2009), Cuevas et al. (2012) and appendix A of Pateiro-Lopez (2008). 

Finally, two smooth sets A and B are called Jiormal compatible (Chazal et ah, 
2007) if the projection between A and B are one to one and onto; see Figure 4 for 
an example. When A and B are normal compatible, the Hausdorff distance has the 
simpler form 

= sup d(x, A) = sup d(x, B ). 

xEB xEA 
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Haus(A, B ) 


( 10 ) 




3 Asymptotic Theory 


In this section, we derive the asymptotic theory for Haus(D/ l , Dh). Note first that for 
two sets A and B , the Hausdorff distance satisfies the inclusion property 


A C B © Haus(A, B), B C A © Haus(A, B). (11) 


From this, it follows that when we have a set estimator A n and a parameter of interest 
A, then for any a > 0, the set 


A n © Quantile ^Haus(A n , A), 1 — aj 


( 12 ) 


is a 1 — a confidence set for A, where Quantile(A, a) is the a-quantile of random 
variable A". Thus, whenever we can approximate the distribution of Haus(A n , A), we 
can construct a confidence set for A. Note that Chen et al. (2014b) and Chen et al. 
(2014a) have used this property to construct confidence sets, but neither paper used 
this property to full effect. 

We now define some notation that will be used throughout this paper. Let BC r 
denote the collection of functions (including both univariate and multivariate func¬ 
tions) with bounded continuous derivatives up to the r-th order. For a smooth func¬ 
tion / : R d i->- R with / G BC r , we denote the elementwise max norm for r-th 
derivative as ||/|| rimax . For instance, 

II f(x) 111 , max = max 
l<i<d 

We define the sup norm using derivatives up to r-th order by: 


df(x) 


dxi 


|| f(x) II 2 ,max = HiaX 
1 <i,j<d 


d 2 f(x) 


dxidxi 


ll/llr,max = max j sup \\f(x) ||^ max : l = 0, • • • ,r|. (13) 

f zeK J 

Let a = (cti, • • • , ad) be an multi-index such that each oij is a non-negative integer 
and |a| = + • • • + We define 



0W 

d ai x\ ■ ■ ■ d ad Xd 


f{*) 


to be the partial derivative. Note that for a vector v G R d , the norm ||r|| denotes the 
usual Euclidean norm. Throughout this paper, we use the conventional notation for 
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Op(-) and O(-) for n —> oo and h = h n , where possibly h n —>■ 0. 

We now state are main assumptions, for an arbitrary density q. When we use the 
assumptions in what follows, we will take q to be ph- 

Assumptions. 

(G) Let D(q) = {x G IK : q(x) = A} be the level set for a density q. There are 
S 0 ,g 0 > 0 such that \/x G D(q) © <5 0 , we have ||Vg(x)|| > g 0 . 

(Kl) The kernel function K G BC 3 and is symmetric, non-negative, and 
j x 2 K^(x)dx<oo, j (R^(x)fdx<oo 
for all multi-indices a satisfying |a| < 3. 

(K2) The kernel function K and its partial derivative satisfies condition K\ in Gine 
and Guillou (2002). Specifically, let 

/C = | K {a) ■ iGl d ,/i> 0, |af| < 2j* (14) 

We require that /C satisfies 

supAf (K, i 2 (P), £||P|U s(f) ) <0) (15) 

for some positive numbers A and v, where N(T,d,e ) denotes the e-covering 
number of the metric space ( T,d ), F is the envelope function of /C, and the 
supremum is taken over the whole The A and v are usually called the VC 
characteristics of 1C. The norm ||F||| 2 ^ = j \F(x)\ 2 dP(x). 

Assumption (G) appears in Molchanov (1991); Tsybakov (1997); Walther (1997); 
Molchanov (1998); Cadre (2006); Mammen and Polonik (2013); Laloe and Servien 
(2013). For a smooth density q, (G) holds whenever the specified level A does not 
coincide with the density value for a critical point. 

Assumption (Kl) is to guarantee that the variance of the KDE is bounded and to 
ensure that ph G BC 3 . This assumption is very common in statistical literature, see 
e.g. Wasserman (2006). Assumption (K2) is to regularize the complexity of the kernel 
function so that the supremum norm for kernel functions and their derivatives can be 
bounded in probability. Similar assumption appears in Einmahl and Mason (2005) 
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and Genovese et al. (2014). The Gaussian kernel and many compactly supported 
kernels satisfy both assumptions. 

An immediate result from assumption (G) is the smoothness of the density level 
set. This smoothness property will be used to understand the distribution of Haus(ll/ l , D h ). 

Lemma 1 (Smoothness Theorem). Assume a density ph G BC 2 satisfies condition 
(G) and let D h denote the level set for p h at X. Then 

reach (D h ) > min ( -r — ^ - 1 . (16) 

l 2 II2,max J 

Moreover, let q G BC 3 be another density function and define D(q ) as the level set 
for q at level X. When ||pu — <?|| 2 ,max sufficiently small, 

1. Condition (G) holds for q. 

2. reach (D(q)) = min |f, } + 0(\\p h - g||£ imax ). 

3. Dh and D(q) are normal compatible. 


The proof is given in the supplementary materials. Lemma 1 is very similar to 
Theorem 1 and 2 in Walther (1997). Essentially, this lemma shows that the level 
set D is smooth and that whenever two smooth densities are sufficiently close, their 
level sets will both be smooth, be close to each other, and have one-to-one and onto 
normal projections between them. 

Given a collection of functions J- = {ft : i— > M : t G T}, where T is some index 

set, the empirical process is defined as 

1 n 

g »(/) = ^T(/(-V)-E(/(A' 1 ))), /e J. (17) 

\/n ^ 


Lemma 2 (Empirical Approximation). Assume (K1-K2) and (G) hold for ph- Let 
Dh and Dh be the density level sets with level X for ph and fih ■ Define the function 


fx(y) 


1 

V^||Vp h (a;)|| 


K 


f x-y \ 

V h J 


with x G Dh- If nh d + 2 —• y 0 , h —¥ 0, then 


(18) 


sup 

x&D h 


Gn(fx) ~ Vnhf ■ d(x, D h ) 
\/nh d ■ d(x, Dh) 


0(\\p h -Ph\\* hmax ) = Oh 



(19) 
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A key element for the proof of Lemma 2 is the smoothness of Dh and Dh , which 
relies on Lemma 1. This smoothness allows us to approximate the local difference by 
an empirical process. 

Remark 2. Lemma 2 implies that for each x e Dh, d(x,Dh ) converges to a mean 
0 Gaussian process. We can use E (d(x, Dh) 2 ^j as a measure of local uncertainty 
Chen et al. (2014b) - analogous to the mean squared error - and thus can apply the 
bootstrap to estimate this quantity. 

Lemma 2 shows that the projected distance to the level sets can be approxi¬ 
mated by a stochastic process (the empirical process) defined on a smooth manifold. 
Specifically, Lemma 2 shows that the projection distance can be approximated by an 
empirical process on certain functions f x , where x € Dh- The level sets Dh now acts 
as an index set. Thus, we define the function space 

D={ f x (y) = 1 - K () : x e D h \ 

\ Vhd\\Vp h (x)\\ V h J ] 

and define a Gaussian process B on T such that for all fi, f 2 G T, 

B(/i) = JV(0,E(/ 1 (X 1 ) 2 )), Coy(B(/ 1 ),B(/ 2 ))=E(/ 1 (X 1 )/ 2 (X 1 )). 


( 20 ) 


( 21 ) 


Theorem 3 (Asymptotic Theory). Assume (K1-K2) and (G) holds for ph- Let Dh 
and Dh be the density level sets with level A forph andph- Then when ^+2 —* 0, h —> 
0, the Hausdorff distance satisfies 


sup 

t 


P (\/nh d Haus (D h , D h ) < t j — P 


sup 

/e^ 


< t 


= 0 


log 7 n 

nh d 


1 / 8 ' 


( 22 ) 

where T is defined in equation (20) and B is a Gaussian process defined on J 7 satis¬ 
fying equation (21). 


The proof of Theorem 3 depends on two geometric observations. First, the em¬ 
pirical approximation in Lemma 2, shows that the local difference is approximately 
the same as an empirical process, and hence the maximum local difference is approx¬ 
imated by the maximum of the empirical process. Second, the normal compatibility 
between Dh and Dh guaranteed by Lemma 1, which implies that maximal of local 
difference is the same as the Hausdorff distance. 
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Theorem 3 shows that the Hausdorff distance Haus(Z5/ l , D /J can be approximated 
by a maximum over a certain Gaussian process. Note that we cannot directly use this 
theorem to construct a confidence set for Dh since the Gaussian process is defined on 
Dh, which is unknown. Later we will use the bootstrap to approximate this limiting 
distribution and construct a confidence set. 

The random variable sup^gj- |B(/)| follows an extreme-value type distribution. 
However, writing down the explicit form for this distribution is not very helpful in 
statistical inference because it involves unknown quantities and because the conver¬ 
gence to the distribution is notoriously slow. Instead, we will use the bootstrap to 
approximate the distribution of Ha us (Dh,Dh), avoiding the unknown quantities and 
yielding much faster convergence. 

4 Statistical Inference 

We now show that we can construct valid confidence sets for Dh with the bootstrap. 
A set S Ut i_ a is called an asymptotically valid confidence set for D h if 

F{D h C S'n.i-o) = 1 - a + 0(r n ), (23) 

where r n —» 0 as n —> oo. We propose two methods for constructing a confidence set, 
and we will show that they are both asymptotically valid. 

4.1 Method 1: Hausdorff Loss 

The first approach is to use the Hausdorff distance between the level sets. Let W n = 
Haus (Dh,Dh) and define 

Wi- a = F Wn ( 1 — a), (24) 

where Fa denotes the cdf for a random variable A. Then, it is easy to see that 

P (D h C D h ® wi- a ) > 1 - a. (25) 


We use the bootstrap to estimate w i_ Q . 

Let • • • , X* be a bootstrap sample from the empirical measure. Let p* h denote 
the KDE using the bootstrap sample, and D* the corresponding level set. We define 
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W* = Y\aus(D* nl D h ) and 


(26) 


wi- a = F w \(l - a), 

Then the bootstrap confidence set is Dh © w\- a . 

Theorem 4. Assume (K1-K2) and (G) holds for ph and ^f d + 2 —> 0, h — * 0. Let Dh 
and D h and D* n be the density level set with level X for p h and p h and p* h . Then there 
exist X n such that 


snp 

t 


P (y/nh* Haus (p* n ,Dh) 


< t \ X i, — 



— P ^ Vnh d Haus(Dh , Dh) < t'j 
= 0 (dlPh - Ph||max) 1/8 ) (27) 


for all (Xl, • • ■ , X n ) e X n and P(T n ) > 1 — 3e nhd+2A o f or some constants A 0 . Thus, 


P 


D h C D h 


Wi-c 


1 — a + O 



(28) 


An intuitive explanation for Theorem 4 is that as n goes to infinity, the bootstrap 
process converges to the same Gaussian process as the empirical process - thus, they 
share the same Berry-Esseen bound. 

4.2 Method 2: Supremum Loss 

The second approach is to use the supremum norm of the KDE and impose an upper 
and lower bound around the density level. 

Let M n = sup^gu \ph(x) - Phix) | and mi_ a = F^( 1 - a). Define 

C n ,i- a = {x G K : | p h (x) - X\< mi_ a } . (29) 

It is easy to verify that 

P (D h C C n>1 - a ) >l-a. (30) 

Again we use the bootstrap to estimate the quantile. Recall that p* h is the KDE 
based on the bootstrap sample. We define M* = sup xeIK | p* h {x) — Ph(x)\ and set 

fhi- a = F~l(l - a). (31) 
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Then the confidence set is 


C n ,i-a = {i6l: \ph(x) - A| < mi_a}. (32) 

Theorem 5. Assume (K1-K2) and (G) holds for p h and —> 0, h —* 0. Let D h 
and Dh and D* be the density level sets with level A for ph and ph and p* h . Then 


The proof of this Theorem is similar to the proof of Theorem 4, so we omit the 
details. The basic idea is as follows. By equation (30), the quantile of M n can be 
used to construct confidence sets. We then show that Vnh d M n is approximated by 
the maximum of a Gaussian process (similar to Theorem 3 and made explicit in 
Chernozhukov et al. 2014a). Finally, we show that the bootstrap \Jnh d Mf converges 
to a/ nh d M n as in Theorem 4. 

This method embodied in Theorem 5 is very similar to the methods in Jankowski 
and Stanberry (2012) and Mammen and Polonik (2013). Jankowski and Stanberry 
(2012) proposes to construct a confidence set of the form 

C n = {x : A - i n < p h (x) < A + r n } (33) 

with some £ n , r n —* 0. They require that Vnh d (ph ~ Ph) converges weakly to a known 
random field. This convergence is true when h is fixed but is not attainable if we 
allow h = —)■ 0, in contrast to our approach, which supports both cases. They also 
assume the limiting random field is either known or can be easily estimated; we do 
not require that assumption. 

Mammen and Polonik (2013) construct a confidence set using a similar approach 
to method 2, but they focus on the original level set D = {x : p(x) = A} rather than 
the smoothed version Dh- Instead of taking the supremum of the density deviation 
over the whole support IK, they propose to focus on the regions x € DADh- he. 

R n = sup^ \p h (x)-p\, (34) 

x^DADh 

where AAB = {x : x G A,x B} U {x : x E B,x A} is the symmetric difference 
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between sets. Then they use the upper quantile of R n to construct a confidence set of a 
similar form to (29) and apply the bootstrap to estimate the quantile. Their bootstrap 
consistency relies on Neumann’s method (Proposition 3.1 in Neumann (1998)) and 
they assume that h converges fast enough so that one can ignore the bias for estimating 
the original density p. Actually, under their assumptions, our proposed bootstrap 
confidence sets (from both methods 1 and 2) are also consistent for the original level 
set D since the bias converges faster than the stochastic variation. The method in 
Mammen and Polonik (2013) should have higher power than our method 2 since they 
consider taking the supremum over a smaller region. 

Remark 3. We may use a variance stabilizing transform to obtain an adaptive confi¬ 
dence set using similar idea to Chernozhukov et al. (2012). The variance of Ph.(x) is 
proportional to p(x). Thus, we may use 


v' = sup AMzjM 

zsk y/Ph(x ) 


Pi_ Q = F v * (1 - a) 


(35) 

(36) 


and set x \Jph{x) as an adaptive threshold for constructing the confidence set. 
Namely, the adaptive confidence set is given by 


C* } i_ a = jx G K : \p h (x) - A| < Pi_ Q x a Jp h {x)^ . (37) 

Using the same approach as in the proof to Theorem 5, we can show that C* 1 _ a has 
asymptotically 1 — a coverage. 

Remark 4. The rate O may not be optimal. In Chernozukov et al. 

(2014), they apply a induction technique that gives a rate of order n _1//b for the 
Gaussian approximation. Despite not being mentioned explicitly in that paper, we 
believe that similar technique applies to the empirical process. If this is true, the rate 
in Theorem 3, 4 and 5 can be further refined to O 



4.3 Comparing Methods 1 and 2 

Both methods 1 and 2 generate confidence sets with asymptotically valid coverage. 
Figure 5 compares the 90% confidence sets from both methods on the old faithful 
dataset. Apparently, method 1 (left panel; blue regions) is superior to method 2 (right 
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Figure 5: An example of 90% confidence sets using Hausdorff loss of level set (method 
1; blue regions in left panel) and the sup remum loss of the density (method 2; gold 
regions in right panel). This dataset is the old faithful dataset. As can be seen easily, 
the supremum loss of density (right panel) is too huge so that it contains a wide 
regions as the confidence set. On the other hand, Hausdorff loss of level sets (left 
panel) gives a much tighter confidence set. The smoothing parameter h = 0.3 and 
the two axes are from 1.5 to 5.5. 

panel; gold regions) in the sense that the size of confidence set is much smaller. The 
main reason is that both methods use the maximum over certain empirical processes 
but the two processes are defined on different function spaces. Method 1 only takes 
the supremum over a small function space J 7 , in which the index set contains only 
points on the level sets D^. However, method 2 takes the maximum over a large 
function space whose index set is the whole space IK. Thus, we expect the second 
method to have a wider confidence set. 

Despite the fact that method 2 yields a much larger confidence sets, it has some 
nice properties. First, method 2 is very simple: all we need is to compute the boot¬ 
strap distribution of supremum loss. Second, method 2 is connected to the meth¬ 
ods proposed in Mammen and Polonik (2013) and Jankowski and Stanberry (2012). 
Third, the confidence sets produced in method 2 are related to level sets with level 
A±mi_ Q . The last property makes it easy to visualize the confidence sets; see Section 
5.3. 

Here, we consider two simulated datasets to compare the coverage for confidence 
sets constructed using Hausdorff loss (method 1), L^ loss (supremum loss; method 
2), and scaled loss (remark 3). 

The first dataset is a three-Gaussian mixture. The data is generated from the 
following distribution: 

p(x) = ^</> 2 (x; h i, 0.3 2 I 2 ) + ^2(25 AD, 0.3 2 I 2 ) + ^ 2 (z; AD, 0.3 2 I 2 ), (38) 
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Figure 6: An example for density level sets and confidence regions for the old faithful 
data. Left: density level set (thick black contour denotes the specified level A). Right: 
90% confidence regions for the density level sets. We also have 90% confidence that 
(1) all the yellow regions have density above A (2) all green regions have density below 
A and (3) the level sets {x : ph(x ) = A} are within the blue regions. Note that the 
yellow and green regions are the collection of x that we reject H m 0 (x) and H out 0 (x) 
and simultaneously control the significance level at a = 10%. 



4 mixture, N = 1000 



-2 -1 0 1 2 3 4 5 

X 


Figure 7: Simulated data for comparing coverage of confidence sets. Left: Three 
Gaussian mixture dataset. Right: Four-mixture dataset with sample from Cadre 
et al. (2009). The red curves indicate the density level set we are interested in. 


18 












Table 1: Coverage for the three-Gaussian mixture data. 


Sample Size 

Hausdorff Loss 

-^oo 

Loss 

Scaled L Loss 

a = 0.90 

a = 0.95 

a = 0.90 

a = 0.95 

a = 0.90 

a = 0.95 

n = 500 

0.946 

0.983 

0.992 

0.998 

0.957 

0.984 

n = 1000 

0.944 

0.969 

0.991 

0.997 

0.946 

0.974 

n = 2500 

0.915 

0.969 

0.992 

0.998 

0.947 

0.978 


where </> 2 (x;/x, E) is the density to bivariate Gaussian with mean vector /i and Co- 
variance matrix E. I 2 is the 2x2 identity matrix and /ii = (0, 0) r , /i 2 = (1, 0) T , and 
/U 3 = (1.5,0.5) t . We use density level A = 0.3 and smoothing parameter h = 0.2. 
The corresponding level set D^ is the red curve in the left panel of Figure 7. We 
consider three different sample sizes, N = 500,1000, 2500 and compare the coverage 
of confidence sets using the three methods. The corresponding coverages are given in 
Table 1. As can be seen from Table 1, all the three methods have the desire nominal 
coverage. 

The second dataset is a four-mixture dataset from Cadre et al. (2009). The data 
is generated from the following distribution: 

6 

p{x) = fj>e, £<) (39) 

e=1 


with 7Ti = 7 t 2 = 7T3 = 7 t 4 = 1/5 and tt 5 = 7 t 6 = 1/10, and /j i = (—0.3, —0.3) T ,/i 2 = 
(3.0, 3.0) T , /i 3 = IM = (0, 3) T , n 5 = p 6 = (3, 0), and 


£i = 


S 3 — Ek — 


0.39 -0.28 

-0.28 0.39 

0.33 0 

0 0.01 


E 2 = 


E,i — E« — 


0,0 

(40) 

0.36 J 


0.01 0 \ 

(41) 


0 0.33 


We use density level A = 0.05 and smoothing parameter h = 0.2. The corresponding 
level set Df t is the red curves in the right panel of Figure 7. Again, we consider three 
different sample sizes N = 500,1000, 2500 and compare the coverage of confidence 
sets using the three methods. The corresponding coverages are given in Table 2. It is 
clear that all the three methods have the desire nominal coverage. Moreover, it can 
be seen from both Table 1 and 2 that the supremum loss method over covers. 
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Table 2: Coverage for the four-mixture data. 


Sample Size 

Hausdorff Loss 

T'oc 

Loss 

Scaled Loss 

a = 0.90 

a = 0.95 

a = 0.90 

a = 0.95 

a = 0.90 

a = 0.95 

n = 500 

0.991 

0.998 

1.000 

1.000 

0.950 

0.973 

n = 1000 

0.995 

0.998 

1.000 

1.000 

0.917 

0.951 

n = 2500 

0.936 

0.976 

1.000 

1.000 

0.976 

0.991 


4.4 Pointwise Hypothesis Tests for Level Sets 

The confidence sets developed in the previous section are related to two types of local 
hypothesis tests. For fixed density level A and an arbitrary point x, consider the tests 


of whether p(x) is greater or less than A: 




H\nfi(x) ■ Phi?) < A, 

H-m A (x) : p h (x) 

> A, 

(42) 

H outfi (x) : p h (x) > A, 

flout, a(*) : 

) < A. 

(43) 


When we only want to test just a few points, we can do local tests for each point and 
control the family-wise error rate to control the type 1 error rate. Usually, however, 
we are interested conducting the local test at many or even an infinite number of 
points (like a region), making it difficult to control type 1 error simultaneously. 

Inverting the confidence sets of the previous subsection gives a solution to this 
problem. Let 5%i_ a be a confidence set for Dh and let Lh and 14 be, respectively, 
the upper and lower lambda level sets. Then the decision rules 


Tin, n(x) = l{p h {x) > A A X £ S Ui i_ a ), 
Tout, n (*£) ^-{ph{x 0 ^ A A X ^ S n ,\— q). 


(44) 


control type 1 error simultaneously for all x £ IK. This simultaneous control is 
asymptotic in the sense that P(T out n (a;) = 1 Vx 6 L h ) = a + 0(r n ) for r n —>• 0 and 
similarly for Tj n , n (:r) and 14. 

In addition, inverting the regions where we cannot reject H\ n $(x) and H out ${x) in 
(44) yields confidence sets for L h and 14 . In Figure 6, a 90% confidence regions for 
the upper level set L h is the union of yellow and blue regions (T out>n (x) = 0). And 
a 90% confidence regions for 14, the lower level set, is the union of green and blue 
regions (Ti nin (x) = 0). Thus, with 90% confidence, all yellow regions are above A and 
the true high density regions should be contained by the yellow and blue regions. 

Remark 5. The two local tests described in (42) and (43) are relevant to the problems 
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Density Tree 




Figure 8: An example for comparing density trees to our visualization method. Left: 
a 3 dimensional dataset from Chen et ah (2014c). There are 5 clusters and four of 
them are connected by a tube-like structure. Middle: visualization using the density 
tree from DeBaCl (Kent et al., 2013). Right: visualization using our method, which 
preserves the real connections between clusters. It is easy to see the evolution of 
multiple level sets and how each cluster is connected to one another using our method. 
But it is hard to get the same information using density trees. 


of level-set clustering (Hartigan, 1975; Polonik, 1995; Rinaldo and Wasserman, 2010; 
Rinaldo et ah, 2012) and anomaly detection (Desforges et ah, 1998; Breunig et ah, 
2000; He et ah, 2003; Chandola et ah, 2009). Rejecting H- m .o(x) can be viewed as 
evidence that x belongs to a level-set cluster. And a point x where H out0 (x ) is 
rejected can be viewed as anomalous. 

Remark 6. Note that one can modify the local testing procedure to control the False 
Discovery Rate (Benjamini and Hochberg, 1995) rather than familywise error. 


5 Visualization for Multivariate Level Sets 

Level-set estimators can reveal useful information about a distribution, but beyond 
three dimensions, we cannot directly visualize the level sets, making the results diffi¬ 
cult to use. 

In this section, we propose a novel visualization technique density upper level sets 
in multidimensions. Any visualization entails some loss of information, but our goals 
are to preserve important geometric information about the sets, make the overall 
visualization easy to interpret, and give a method that is efficient to compute. Our 
method exploits the relationship between level-set clustering and mode clustering (see 
Figure 9). 

A current and commonly used visualization method for level sets is the density 
tree (Stuetzle, 2003; Klemela, 2004, 2006; Kent et ah, 2013; Balakrishnan et ah, 
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2013). This method considers several density levels, Ai < ■ • • < A k, and computes 
the number of connected components for the upper density level set at each level. 
As we increase the density level, some connected components may vanish and others 
may split into additional components. In the typical case when the underlying density 
function is a Morse function (i.e., Hessian at critical points is non-degenerate) (Morse, 
1925, 1930; Milnor, 1963), the connected component disappears when the density level 
is above the maximum density value over the component and splits only if the density 
level passes the density value of some saddle points within the component (Klemela, 
2009). The vanishing and splitting of components as the level changes produces a 
tree structure. The density tree uses this tree structure as a visualization of the level 
sets. We refer to Klemela (2009) for more details. 

Density trees display primarily topological information about the underlying den¬ 
sity function (Stuetzle, 2003; Kent et ah, 2013) but need not preserve or impart other 
features that may be of interest. Extensions have been proposed that would endow 
the tree with additional information about the distribution (Klemela, 2004, 2006, 
2009), but this is difficult to do with multiple features and can make the visualization 
difficult to interpret. See Figure 8 for an example. 

In this section, we propose a novel technique that visualizes several density (up¬ 
per) level sets that preserve some geometric information and that are very easy to 
understand. Our method is based on the relationship between level set clustering and 
mode clustering (see Figure 9). Note that in this section, we will focus on density 
upper level sets. 

Our method complements existing tree-based methods. It combine two cluster¬ 
ing techniques - level-set and mode clustering - to produce a simple and intuitive 
visualization. 

5.1 Mode Clustering and Density Upper Level Sets 

Mode clustering (in particular, the method known as mean-shift clustering) was intro¬ 
duced in Fukunaga and Hostetler (1975), where data points are clustered by following 
the gradient flows of the kernel density estimator from each point to a local mode. The 
clusters are the “basins of attraction” of each local mode (Fukunaga and Hostetler, 
1975; Cheng, 1995; Comaniciu and Meer, 2002). More generally, given a smooth 
Morse function p, mode clustering works as follows (Li et ah, 2007; Chacon, 2012; 
Chen et ah, 2014c). We form a partition of IK based on the gradient field g = 'S/p. 
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Figure 9: An example for density level set and mode clustering. Left: density level 
set (thick black contour denotes the specified level); four red dots are the local 
modes. Middle: basins of attraction for each local modes intersecting the density 
level set. Right: graph representation. There are two connected components. The 
large connected components contain three local modes 777i,i, 7771 , 2 , 7711,3 with 7711 , 1 , 7711,2 
and 771 i,2,771i, 3 are connected. 

For each x £ IK, we define a gradient flow n x : [0, 00 ) 1 —> IK: 


ir»(0) = x, < (t) = g(T x (t)). 


(45) 


That is, n x (t) starts at x and moves along the gradient of p. We define the destination 
for 7 T x (t) as dest(:r) = lim^oo 7 r x (t). Let A 4 be the collection of all local modes of p. It 
can be shown that dest(x) e A4 except for a set of x’s in a set B with Lebesgue measure 
0 (this set corresponds to the boundaries of clusters). For each mode rrij £ A4, we 
define its basin of attraction as 


Aj = {x £ IK : dest(x) = rrij}. (46) 

The regions Mi, ■ ■ • ,Ak are the clusters generated by mode clustering. 

Now we recall three facts about an upper density level set L = {x : p(x) > A} 
(c.f. Figure 9 left and middle panels): 

1. L can be decomposed into connected components L = Ut 1 C„ where the Cg are 
disjoint, connected compact sets under regularity conditions. 

2. Each Ci contains at least one local mode. 

3. If Ci contains s local modes, then Ci = C\ v U • • • U C\ s U B, where C^ is the basin 
of attraction for a local mode mij intersected with the level set L and B are the 
boundaries of the basins which has 0 Lebesgue measure. Namely, ■ = LnAk 
for some k. 
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Figure 10: A visualization example in multivariate cases. We use the same dataset as 
Figure 8 but add an additional Gaussian noise to each data point to make it a higher 
dimensional dataset. Left: a 6 dimensional dataset (add 3 additional dimensional 
noises). Right: a 10 dimensional dataset (add 7 additional dimensional noises). 

Thus, the upper level sets are covered by the basins of attraction of the local modes 
(the uncovered regions have Lebesgue measure 0 so we ignore them for visualization). 

We then create a graph G = ( V , E ) with each node corresponding to a local mode 
within Lh and each edge representing a connection between local modes to represent 
a level set. Specifically, we add an edge to a pair of local modes (mej, m^k) when the 
corresponding basins C\ rj and C\ k shares the same boundaries, i.e. C^,- D C\ k ^ 0. 
Two local modes have an edge only if they are in the same connected component and 
their shared boundaries are also in the upper level set. Figure 9 provides an example 
illustrating these parts. 

5.2 Visualization Algorithm 

We construct visualization using Algorithm 1. First, we perform mode clustering to 
obtain local modes mi, ■ • ■ , rn k G M d . We use the mean shift algorithm (Fukunaga 
and Hostetler, 1975; Cheng, 1995; Comaniciu and Meer, 2002) to find the modes 
of density estimate. Second, we perform multidimensional scaling (MDS; Kruskal 
1964) to all the modes to map them onto a 2D plane. Let m\, ■ ■ ■ , rri\ e M 2 be the 
corresponding locations after MDS. Third, we assign local mode m£ an index 

re( A) = — n e ( A) = ^ 1 (dest(Ab) = rn t Ap h (X t ) > A) , (47) 
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where ng(A) is the number of data points that are assigned to rri( by mode clustering 
and have (estimated) density being greater or equal to A. Fourth, if ry(A) > 0, we 
create a circle around each m\ with radius proportional to ry(A) If ry(A) = 0, we 
ignore this local mode; this occurs if the (estimated) density of the mode (and its 
basin of attractions) lies below A. Fifth, we connect two local modes m\ and 
if they belong to the same connected component of the estimated upper level set 
Lh — {x : Ph{x ) > A} and their basins of attraction intersect Lh share the same 
boundary (see e.g. Figure 9 middle and right panel). Note that this step might 
be computationally difficult in practice. An efficient alternative is to examine the 
shortest distance between two connected components; if the distance is sufficiently 
small, we claim these two components share the same boundary. Finally, we set the 
width of the line connecting rrq and rrij to be proportional to rg( A) + rj( A). 

Given several density levels, Ai < • • • < A k, we can overlay the visualization from 
the previous paragraph from Ai to A k to create a “tomographic” visualization of the 
clusters. This gives a visualization for the density level sets. Figure 10 shows an 
example for visualizing level sets for a 6-dimensional and a 10-dimensional simulation 
datasets at different density levels. This dataset is from Chen et al. (2014c). 

Algorithm 1 Visualization for a single level set 

Input: Data {Ad, ...X n }, density level A, smoothing parameter h 

1 . Compute the kernel density estimator ph- 

2. Find the modes rn \, ■ • • , m*, of p\ (one can apply the mean shift algorithm). 

3. Apply multidimensional scaling to the modes to project them into M * 1 2 3 4 5 6 7 ; denote 

the corresponding locations as m\, ■ ■ ■ , m\. 

4. For each local modes mg, we assign it an index (47) 

re{ A) = ne(X) = ^ 1 (dest(Ad) = m e Ap h (Xi) > a) . 

1=1 

5. If r^(A) > 0, we create a circle around m\ with radius in proportional to ry(A). 

6 . We connect two local modes m\ and mj if 

(1) they belong to the same connected component of L h , the estimated upper 
level set at level A, and 

(2) their basins of attraction above level set share the same boundary. 

7. Adjust the width for the line connecting rri\ and mj to be in proportion to 

r/(A) +c(A). 
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Figure 11: A visualization for confidence sets of an upper set with level A. This is the 
6 dimensional dataset of Figure 10. We pick the density level A = 0.4 x sup x ||ph(a;)|| 
and display confidence levels under a = (50%, 80%, 90%, 95%). 

5.3 Visualizing Level Sets with Confidence 

A modified version of our algorithm allows us to visualize multivariate confidence sets 
for an upper level set at a given level A. In particular, we visualize the confidence 
set for the level sets produced by Method 2 (supremum loss, see Section 4.2) . The 
advantage of Method 2 here is that this confidence set under different cbs is just the 
density level set at different A’s. This makes it easy to visualize K distinct confidence 
sets for pre-specihed levels op < • • • < olk- 

Recall that M* = sup xeK |p£(a;) — ph(x)\ is the bootstrap supremum deviation 
for the KDE and rh\- a = F^\ (1 — a) is the quantile. When we want to visualize 
confidence sets for the upper level set Lh at different significance levels, we pick 

Ai = A — ,\ K = A-mi_ QK (48) 

and use the visualization algorithm to create a tomographic visualization. Figure 11 
provides an example for visualizing confidence sets for upper level set using the 6 
dimensional simulation data in Figure 10. 

Remark 7. At the cost of additional computation, we can also use Method 1 (Haus- 
dorff loss) instead of Method 2, combining it with the basins of attractions to visualize 
the confidence sets. We use method 1 to construct the inner/outer confidence sets and 
then we find the connected components and partition it by the basins of attraction 
for local modes and use multidimensional scaling to visualize it in low dimensions. 
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6 Discussion 


In this paper, we derived the limiting distribution for smoothed density level sets 
under Hausdorff loss. This result immediately allows us to construct confidence sets 
for the smoothed level set. We developed two bootstrapping methods to construct the 
confidence sets, and we showed that both methods are consistent. These confidence 
sets can be inverted to construct multiple local tests of whether a point’s density 
value is above or below a given level, which has application to related problems such 
as anomaly detection. Finally, we developed a new visualization method that is 
informative and interpretable, even in multidimensions. 

Although we focused on density level sets in this paper, our methods, including 
confidence sets and visualization, can be applied to a more general class of problems 
such as kernel classifier, two-sample tests, and generalized level sets (See Mammen 
and Polonik 2013 for more details.). 
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7 Proofs 

Theorem 8 (Theorem 2 in Cuevas et al. (2006)). Assume (Kl-2) and (G), then we 
have 

Haus (p h , Dhj = 0(\\p h - p h || 

0,max) • 
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Theorem 9 (Talagrand’s inequality; version of Theorem 12 in Chen et al. (2014a)). 
Assume (Kl-2), then for each t > 0 there exists some n o such that whenever n > no, 
we have 

p (lift, - ft,II;,™* >t)<(t+ lje-'"' 1 "'- 4 *, 

for some constant A\ and t = 0,1, 2. Moreover, 

E (11^ -^Il2,max) = 0 • 


Proof for Lemma 1. We first prove the lower bound for reach (D h ) and then 
we will prove the additional assertions. 


Part 1: Lower bound on reach. We prove this by contradiction. Take x near 
D h such that 


d(x, D h ) < 


f go \ 

\\Ph II 2 , max / 


(49) 


We assume that x has two projections onto Dh, denoted as b and c. 

Since b, c 6 D h , p h (b ) — A = Ph(c) — A = 0 so that Ph(b) — p h (c) = 0. Now by 
Taylor’s theorem 


|| (.b - c) T Wp h (b) || = || p h (b) - p h (c ) -(b- cfWphib) || 

X 

< jll* - c|| ||ft,||2,mm- 


(50) 


By the definition of projection, we can find a constant 6 1 such that x — b = 
tb^Phip). Together with (50), 


2|(6 - cf(x — 6)| = 2|(6 — c) T Vp h (b)t b \ 

< ||(6 - c) T Vp h {b)\\\t b \ (51) 

< IKIb.maxll^ - c|| 2 |tfc|. 

Since both b and c are projection points from x onto D b , 


x — 6|| 
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Thus, we have 


0 


x — b || 2 

||6 — c|| 2 + 2(6 — c) T (x — b ) 



(52) 


> ||6 - c|| 2 - ||p/i|| 2 ,max|| 6 - c|| 2 |4 
= ||6 - c|| 2 (1 - ||pft || 2,max1)■ 


2, max 


Recall that d(x, D h ) < n—if 2 — and by Taylor’s theorem, 

\\Ph || 2,max 


,max 



> d(x, D h ) = ||x — 6|| = \\t b Vp h (b)\\ = |t fe |||Vp fe (6)|| > \t b \g 0 (53) 


so that |4|||ph||2,max < 1- Note that the lower bound g 0 in the last inequality is 


because d(x,Dh ) < y so it follows from assumption (G). Plugging in this result into 


the last equality of (52), we conclude that ||6 — c|| = 0. This shows b = c so that 



we have a unique projection. Thus, whenever d(x, D h ) < 


unique projection onto D b and thus we have proved the lower bound on reach. 

Part 2: The three assertions. The first assertion is trivially true when 
||ph — ?11 2 max is sufficiently small since assumption (G) only involves gradients (first 
derivatives). 

The second assertion follows from the lower bound on reach. By assertion 1, 
(G) holds for q. And the lower bound on reach is bounded by gradient and second 
derivatives so that we have the prescribed bound. 

The third assertion follows from Theorem 1 in Chazal et al. (2007) which states 
that if two d — 1 dimensional smooth manifolds M\ and M 2 have Hausdorff distance 
being less than (2 — */2) min{ reach (Mi), reach(M 2 )}, then M\ and M 2 are normal 
compatible to each other. Now by Theorem 8, the Hausdorff distance between D b 
and D(q) is at rate 0(\\ph — <z||i,max) so that this assertion is true when ||p^ — g|| 2 ,max 
is sufficiently small. □ 

Proof of Lemma 2. Let x e D h . We define n(x) e D h to be the pro¬ 
jected point onto Dh- By Lemma 1 and Theorem 8, when \\ph — P/i|| 2 max 0) 
Haus (Dh,Dh) —> 0 so that n(x) is unique. Thus, we assume n(x) is unique. 

Now since n(x) G Dh and x G Dh, ph(Jl(x)) — ph(x ) = 0. Thus, by Taylor’s 
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theorem 


Ph(x) - Ph(x) = p h (x) - p h {U(x )) 

= (X - n(x)) T (Vp ft (n(x)) + o P (||x - n(x)||)). 


(54) 


Note that x — H(x) is normal to Dh at II(x) so that it points toward the same direction 
as Vp/,,(n(.x')). Thus, (54) can be rewritten as 

Ph(x) -Ph{x) = ||x - n(x)||(||Vp fc (n(x))|| + 0 P (||x - n(aj)||)). (55) 

By Taylor’s theorem, Vp/i(IT(x)) is close to Vph(x) in the sense that 


Vp ft (n(x)) = Vp h (x) + 0(\\p h - p h H^max)- ( 56 ) 

In addition, 0(||x — fl(x)||)) is bounded by 0(Haus(D/ l , Dh,)) which is at rate 0(\\ph — 
p^Htmax) due to Theorem 8. Putting this together with (55), we conclude 


Ph{x)-p h {x) = ||x — II(x)|| (||Pft(x)|| + 0(\\ph — Phllgmax)) 
= d(x,D h )(\\p h (x)\\ +0(\\p h -p h \\l max )). 


(57) 


Note that the left hand side can be written as 


Ph(x)~p h (x) = ("—r^) — yyE (K 

yny , yny j ^ \ h ) h d \ \ h )) y/nh d 


G n (/ X ), (58) 


where f x (y) = K (\^)- After plugging (58) into the left hand side of (57), dividing 
both side by \\p h (x)\\ and setting f x (y ) = , we obtain 


^=j&n(fx)-d(x,D h ) 
d(x, D h ) 


= 0(\\p h -Ph\\l, max )- 


(59) 


This holds uniformly for all x G Dh and note that the definition of J- is 




Vh d \\S7p h {x)W V h 


K ( - . !l | : x G D h 
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So we conclude 


sup 

x£ D h 


^p G n(fx) ~ d(x,D h ) 


d(x, D h 


= 0(\\p h ~ Ph\\l max )- 


□ 


Proof for Theorem 3. The proof for Theorem 3 follows the same procedure 
as the proof of Theorem 6 in Chen et al. (2014b). The proof contains two parts: 
Gaussian approximation and anti-concentration. 

Part 1: Gaussian approximation. Basically, we will show that 
Vn¥Haus(D h ,D h ) «sup|G n (/)| «sup|B(/)|, 

feT feT 

where B is a Gaussian process defined in (13) of the original paper. 

First, when || ph — Ph\\ is sufficiently small, Dh and Dh are normal compatible to 
each other by Lemma 1. Then by the property of normal compatible, 


sup d(x,D h ) = Haus (D h ,D h ). 

x&D h 


( 60 ) 


Thus, the difference 


Vnh d Haus(D h , D h ) - sup |G n (/)| 

— 

Vnh d sup d(x,Dh) — sup |G n (/)| 

/6-F 


x&D h /eJ 7 


sup x&Dh 


< 


= sup 

z6 D h 


^^"(/i) d(x,Dh) 


V nh d 

y^pG n (/ x ) — d(x, Dh) 


(61) 


d(x,D h ) 
= 0{\\p h -Ph||I imax ). 


Op(l) 


Note that the last two inequality follows from the fact that d(x, Dh) < Op( ^= ). By 
Theorem 9 the above result implies, 


P 


Vnh d Uaus(D h , D h ) — sup |G n (/)| 

feT 


> t) <2e 


-tnh d+2 A 2 


( 62 ) 


for some constant A%. 

Now by Corollary 2.2 in Chernozhukov et al. (2014c), there exists some random 


36 


























variable B = supj eJF |B(/)| such that for all 7 G (0,1) and n is sufficiently large, 


P 


sup |G n (/)| - B 
/e-7 7 


> A, 


log 2/3 W \ 

7 1 / 3 (?rh d ) 1 / 6 I 


< Ai7- 


(63) 


Note that this result basically follows from the same derivation of Proposition 3.1 in 
Chernozhukov et ah (2014c) with the fact that g = 1 in their definition. 

Combining equations (62) and (63) and pick t = 1 /\/nh d+2 , we have that for n is 
sufficiently large and 7 G (0,1), 


P 


Vnh d Haus(D h , D h ) — B 


> A c 


log : 


. 2 / 3 , 


n 


+ 


7 1 / 3 (nh d ) 1 / 6 a/ nh d+2 


^ -f- 2c 


—Vn/i d+2 A2 


(64) 


Part 2: Anti-concentration. To obtain the desired Berry-Esseen bound, we 
apply the anti-concentration inequality in Chernozhukov et al. (2014c) and Cher¬ 
nozhukov et al. (2014a). 

Lemma 10 (Modification of Lemma 2.3 in Chernozhukov et al. (2014c)). Let B = 
sup/gj- |B(/)|, where B and T are defined as the above. Assume (Kl-2) and that there 
exists a random variable Y such that P(|F — B| > rj) < 5{rf). Then 


sup |P(F < t) — P (B < t)\ < A 5 E(B )?7 + S(rj) 

t 

for some constant A 5 . 

ft is easy to verify that assumptions (Kl- 2 ) imply the assumptions (Al-3) in 
Chernozhukov et al. (2014c) so that the result follows. Note that in the original 
Lemma 2.3 in Chernozhukov et al. (2014c), E(B) should be replaced by E(B) -flog 77 . 
However, E(B) = 0(y/logn) due to Dudley’s inequality for Gaussian process (Van 
Der Vaart and Wellner, 1996) and later we will End that log 7 is also at this rate so 
we ignore log 77 . 
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From Lemma 10 and equation (64), there exists some constant A§ such that 


sup 

t 


F^Vnh d Haus(D h , D h ) < t j — P |^sup | 


< t 


< ae(b) i /i, i° g2/ ' ( rL + 1 


+ + 2e~' /n ^ M (65) 


d A 6 A 3 


Now pick 7 = and use the fact that and 2e AnJA+^M 

faster than the other terms; we obtain the desired rate. □ 


7 1 /3( n / z rf)l/6 y/nhd+2 

+ ,/]W) + A 47 + 2e-^*. 

1 / 3 (nh d ) 1 / 6 V nh d+2 


converges 


PROOF for Theorem 4. This proof follows the same strategy for the proof of 
Theorem 7 in Chen et ah (2014b). We prove the Berry-Esseen type bound first and 
then show that the coverage is consistent. We prove the Berry-Esseen bound in two 
simple steps: Gaussian approximation and support approximation. 

Let X n = {(W, • • • , X n ) : \\p h — p^l^max — ho} for some small r] 0 so that whenever 
our data is within X n , (G) holds for ph. By Lemma 1, such an rjo exists and by 
Theorem 9 we have P(T n ) > 1 — 3 -nh d+4 A a f or some constant A 0 . Thus, we assume 
our original data Ad, • • • , X n is within X n . 

Step 1: Gaussian approximation. Let P n and P* be the empirical measure 
and the bootstrap empirical measure. A crucial observation is that for a function 

fM = K(p?), 

P n(U) = J K (~J~) = hi Pk( x ). (66) 

Also note 

K(U) = Jl< (dd) (67) 

Therefore, for the bootstrap empirical process G* = -^/n(P* — P), 

Ph( x ) - M x ) = —^G* (/*). (68) 

Thus, if we sample from ph and consider estimating ph by p* h , we are doing exactly 
the same procedure of estimating ph by ph■ Therefore, Lemma 2 and Theorem 3 hold 
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for approximating Ha us ( 74 /g D h ) by a maxima for a Gaussian process. The difference 
is that the Gaussian process is defined on 


T n = < f x (y ) = 7 - K ( ^ : x E Dh\ (69) 

\ V^\\Vp h (x)\\ \ h J h f [ J 

since the “parameter (level sets)” being estimated is D h (the estimator is D £). Note 
that T n is very similar to J- except the denominator is slightly different and the 
support Dh is also different from Dh- That is, we have 


sup 

t 


P^Vni^Haus (D* h ,D h ) < t 


X u ---,X n 


- P ( sup |B„(/)| < t 

J 6-Fn 


X \, ■ ■ ■ , X n 


< o 


, 7 \ 1/8 N 

log n x 
nh d 


(70) 


where B n is a Gaussian process on T n such that for any h, f 2 E T n , 


Ah) i Xn) — 0, Cov(B n (/ 1 ), B(/ 2 )|Ad, • • • , X n ) = - V h(Xi)f 2 (X i ). 

n z —^ 

(71) 


2=1 


Step 2: Support approximation. In this step, we will show that 


sup |B n (/)| » sup |B n (/)| » sup |B(/)|. (72) 

feT„ f&F f&F 

The first approximation can be shown by using the Gaussian comparison lemma 
(Theorem 2 in Chernozhukov et al. (2014b); also see Lemma 17 in Chen et al. (2014b)). 
We do the same thing as Step 3 in the proof of Theorem 8 in Chen et ah (2014b) 
so we omit the details. Essentially, given any e > 0, we can construct a pair of 
balanced e-nets for both T and T ni denoted as {< 71 , ■ • • ,gx} and {< 7 ^, - - • , 5 ^} so 
that rnaXj \\gj - g™\\* m ax = 0(\\p h - p h \\ g max )- Then this e-net leads to 


sup 

t 


P 


sup |B n (/)| < t 

/S^n 



-P 


sup |B n (/)| < t 
f 6 ^ 



< O ((llPh -Phllgmax) 173 ) • 


(73) 


The difference between supy gjF |B n (/)| and supj g j- |B(/)| is small since the these two 
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Gaussian processes differ in their covariance but as n —>■ oo, the covariances converges 
at rate 1 /y/n so that we can neglect the difference between them. Thus, combining 
(70) and (73) and the argument from previous paragraph, we conclude 


sup 

t 


P 


( Vnh d Haus(Dl, 




D h ) < t 


, X n 


P 


sup 

J^F 


< t 



(74) 

Now comparing the above result to Theorem 3 and using the fact that the first big-0 
term dominates the second term (the first is of rate —1/8 for n but the second term 
is at rate —1/6 by Theorem 9), we conclude the result for first assertion. 

For the coverage, let W n = Haus (Dh,Dh) and w n ,i-o = Fw ri (1 — «)• Since Dh C 
D h © Haus (D h , D h ), we have 


P (Dh C D h © w n> i_a) = 1 - a. 


(75) 


Now by the first assertion, the difference for w n ^ a and the bootstrap estimate w* 1 _ c 

l/8\ 

, which completes the proof. 


differs at rate O 
□ 


log 7 n 
nh d 


40 








